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abstract 

In this paper we focus on the problem of designing a collec- 
,° v ‘of Lonomou. agents that individually 
of actions such that the resultant sequence of joint actions 
achieves a predetermined global objective. We are particu- 
larly interested in instances of this problcm where cen r 
ized control is either impossible or impractical. For sing 
agent systems in similar domains, machine learning meth- 
ods^ g reinforcement learners [18]) have been successfu y 
used [l 6 2, 3, 31]. However, applying such solutions direct > 
to milti-age’nt system often proves prob ^ 
may work at cross-purposes, or have difficulty “ ? * 
ine their contribution to achievement of the global objec 
tive or both. Accordingly, the crucial design step m multi- 
agent systems centers on determining the private objective 
Teach agent so that as the agents strive for those objec- 
tLs the system reaches a good global solution. In this 
work we consider a version of this problem involving mul- 
tiple autonomous agents in a grid world. We use concepts 
from collective intelligence [19, 27, 30] to design goals fo 
the agents that are “aligned” with the global goal, and are 
“learnable” in that agents can readily see how their be 

ior affects their utility. We show that ^^“^TxTen 
acents using those goals outperform both natural exten 
of .ingle «... algorithms and global re.nforc.men. 
learning solutions based on “team games . 

1 INTRODUCTION 

Many challenging problems involve coordinating a l^ge 
number of autonomous agents to collectively 
defined global, time-dependent task. Examples of such 
problems include controlling constellations of satellite con- 
structing distributed algorithms, routing over a data net 
work, and controlling a collection of planetary exploration 
vehicles (e.g., rovers on Mars, or submersibles under Eu- 


permission and/or a fee. 


ropa’s ice caps). In each case, using a centralized controller 
L impraetka^ (or in some cases impossible). For such prob- 
lems there are two fundamental issues that need to be ad- 
dressed: 

. ensuring that the agents learn a sequence of actions 
that optimize each agent’s “payoff utility function 
(i.e., achieve a private goal); and 

. ensuring that, as far as the provided “ w ’ orld ^ 
function” is concerned, the agents do not work at cross 
purposes (i.e., making sure that the private goals of the 
agents and the global goal are aligned ). 

For single agent systems, the first of these issues has been 
dealt with extensively, and there are many learning systems, 

I 0 learners [21]) that have successfully been applied to 
re^ world problems [2], The second problem has received 
less attention, and generally the solution consists of either 
each agent receiving the world utility as their payoff utility 
(e g “team” games [4]) , or of imposing external mechanisms 
(eig.' contracts, auctions) that encourage the agents to wor 

t0 Addreikfthese two issues simultaneously is one of the 
main problems in designing multi-agent system [ 16J. 

the agents are not designed to work well with each other 
they may not learn their task properly, may 
each other’s ability to contribute to the world ^ ut hty o 
simply perform useless repetitive work Hand tailomg J* 
agents’ payoff functions may offer an alternative but such 
systems- (i) have to be laboriously modeled; (n) pro\ i 
“ brittle” g obal performance; (iii) are not “adaptive to chang- 
ing environments; and (iv) generally do not scale welb 

To sidestep these problems, yet address the design require- 
ments listed above (i.e., “alignedness” and ^naluhty ) 
“ ne can US e the “Collective INtelhgence” (COIN) frame 
work [19 28]. A COIN is a large multi-agent system wher 
there is a well-defined “world utility" function which rate 
the possible dynamic histories of the collection and where 
tre is little to no centralized control We 
interested in the case where each agent is selfish 
a Reinforcement Learning (RL) algorithm [lb]. 

Ghen this framework, COIN theory addresses a new de 
sign problem: Assuming the individual agents are ab 
maximize their own utility functions (e.g., through reinforce, 
“earning), what set of payoff utilities for the individual 
agents will, when pursued by those agents, result m high 



world utility? In other words, how can we leverage an as- 
sumption that our learners are individually fairly good at 
what they do, to induce good collective behavior? 

There axe two quantifiable properties (discussed in detail 
in Section 2) that help answer this question. First, the utility- 
functions for the individual agents need to be “aligned with 
the world utility, in that an action taken by an agent that 
improves its payoff utility also improves the world utility. 
Second, the utility functions need to be “learnable” in that 
an agent has to be able to discern the effect of its actions on 
its utility and select actions that optimize that utility. As 
we will highlight below, COIN theory provides utilities for 
individual agents that maximize the second property while 
satisfying the first one. 

A canonical example of a naturally occurring system that 
can be viewed as a COIN is a human economy. One can 
take the agents to be the individuals trying to maximize 
their payoff utilities (e.g., maximize bank account, advance 
career). One might then take the time average of the gross 
domestic product as the world utility ( “world utility” is not a 
construction internal to a human economy, but rather some- 
thing defined from the outside). To achieve high world util- 
ity it is necessary to avoid having the agents work at cross- 
purposes lest frustrational phenomena like the tragedy of 
the commons occur, in which individual avarice works to 
lower world utility [ 8 ], One way to avoid such phenomena is 
by modifying the agents’ utility functions via punitive leg- 
islation, in essence making sure the agents’ utility functions 
are aligned with the world utility. Securities and Exchange 
Commission (SEC) regulations designed to prevent insider 
trading can be viewed as a real world example of an attempt 
to make such a modification to the agents’ utilities. For ex- 
ample, a trade that once may have added to your wealth 
while hurting the economy, may now lead to your prosecu- 
tion. You are therefore unlikely to make such a trade. Your 
utility and the world utility have become more aligned. 

In designing a COIN we have more freedom than the SEC 
though, in that there is no base-line “organic” payoff util- 
ity function over which we must superimpose legislation-like 
incentives. Rather, the entire “psychology” of ‘he individ- 
ual agents is at our disposal when designing a COIN, inis 
freedom is a major strength of the COIN approach, in that 
it obviates the need for honesty-elicitation mechanisms, like 
auctions, which form a central component of conventional 
economics. 

The COIN design problem is related to work in many 
fields beyond multiagent systems and computational eco- 
nomics, including mechanism design, reinforcement learn- 
ing for adaptive control, computational ecologies, and game 
theory. However none of these fields directly addresses the 
inverse problem of how to design the agents’ utilities to reach 
a desirable world utility value in its full generality. This is 
even true for the field of mechanism design, which while 
addressing an inverse problem similar to that of CC)IN e- 
sign, does so only for certain restricted domains, and does 
not address the “learnability” issue. (Mechanism design is 
mostly appropriate when there are pre-specified underhung 
agents’ utilities over which “incentives” need to be provided, 
and when Pareto-optimality (rather than optimization of a 
world utility) is often the goal [27].) 

The COIN framework has been successfully applied to 
multiple domains including packet routing over a data net- 
work [29] and the congestion game known as Arthur s El 


Farol Bar problem [30]. In particular, in the routing domain, 
the COIN approach achieved performance improvements of 
a factor of three over the conventional Shortest Path Al- 
gorithm (SPA) routing algorithms currently running on the 
internet [26], and avoided the Braess’ routing paradox which 
plagues the SPA-based systems [19]. 

In the work described above, agents were concerned with 
optimizing “rewards” (i.e., utility value of a single time 
step). In this paper we extend these results to a problem 
where agents need to optimize a time-extended utility func- 
tion through selecting sequences of actions. We show that 
in this significantly more complex domain, agents that use 
COIN theory-based utilities provided solutions that are sig- 
nificantly superior to agents that either use team games or 
“natural” utilities. In Section 2, we provide some back- 
ground on COIN-theoretic concepts and highlight relevant 
theoretical developments. In Section 3, we describe the 
problem domain and develop the COIN solution to this 
problem. In Section 4, we present and discuss the simulation 
results Finally in Section 5, we provide a simple example 
that demonstrates how and why COIN-theory based algo- 
rithms significantly outperformed more “natural” or tradi- 
tional” approaches. 

2. BACKGROUND: COLLECTIVE INTEL- 
LIGENCE 

In this section, we summarize the portion of COIN theory 
necessary and sufficient to describe the learning of sequences 
of actions in a distributed system [24], Let Z be an arbi- 
trary vector space whose elements C give the joint move of 
all agents in the system (i.e., C specifies the full “worldline 
consisting of the actions/states of all the agents). The pro- 
vided world utility G( C), is a function of the full worldline, 
and we wish to search for the C that maximizes G{ Q. 

In addition to G , for each agent t 7 , there is a payoff util- 
ity functions {<?„}. The agents will act to improve their 
individual payoff functions, even though, we, as system de- 
signers are only concerned with the value of the world utility 
G. To specify all agents other than 77 , we will use the nota- 
tkm'77. 

2.1 Intelligence 

We need to have a way to “standardize” utility functions 
so that the numeric value they assign to a C only reflects the 
ranking of C relative to certain other elements of Z. We call 
such a standardization of some arbitrary utility U for agent 
77 the “intelligence for 77 at C with respect to U ” . Here we 
use intelligences that are equivalent to percentiles: 


eu(C-v) = j d^(OQM0-u(O], 


where the Heaviside function © is defined to equal 1 when 
its argument is greater than or equal to 0 , and to equal 
0 otherwise, and where the subscript on the (normalized) 
measure dp indicates it is restricted to C' sharing the same 
non -77 components as C - 1 Note that intelligence value are 
always between 0 and 1. Intuitively, intelligence values in- 
dicate what percentage of 77 ’s actions would have resulted in 


lr The measure must reflect the type of system at hand, e.g., 
whether Z is countable or not, and if not, what coordinate 
system is being used. Other than that, any conyement cho'ce 
of measure may be used and the theorems will still hold. 



lower utility. Accordingly, e 9n (C : fl) = 1 means that agent 77 
is fully rational at C, in that its move maximizes its payoff, 
given the moves of other agents. Figure 1 shows an exam- 
ple where 60% of r/’s actions would have resulted in worse 
utility, giving 77 an intelligence of 0.6 at that point (0- 



Figure 1: Intelligence of agent 77 at state £ for utility 
Ui C actua l j°i nt move at hand. The x-axis 

shows agent 77’ s alternative possible moves (all states 

having C’s values for the moves of all agents other 
than 77.). The bold lines show the alternative moves 
that 77 could have made that would have given 77 a 
worse value of the utility U. The fraction of those 
bold lines to the full set of rj's possible moves (which 
is 0.6 in this example) is the intelligence of agent 77 
at ( for utility U y denoted by et/(C : 77). 

Our uncertainty concerning the behavior of the system is 
reflected in a probability distribution over Z. Our ability 
to control the system consists of setting the value of some 
characteristic of the collection of agents, e.g., setting the 
payoff functions of the agents. Indicating that value by s, 
our analysis revolves around the following central equation 
for P(G | s), which follows from Bayes’ theorem: 

P(G | s ) - J gP{G I €g,s) J dt g P(tG \ t g ,s)P{t g \ s) , (2) 

where ? 9 == {t Sni (C : (C ^2), ’ ‘ * ) is the vector of 

intelligences of the agents with respect to their associated 
payoff functions, and cg = ( e c?(C • ■ 772)? ) is the 

vector of the intelligences of the agents with respect to G. 

Note that, from a game- theoretic perspective, a point C 
where all players are rational, {e 9r] (C : v) — 1 f° r a g ents 
is a game theory Nash equilibrium [5]. On the other hand, 
a £ at which all components of cg = 1 is a local maximum 
of G (or more precisely, a critical point of the G{() surface). 

If we can choose s so that the third conditional probability 
in the integrand, P(e 9 | 5), is peaked around vectors e, all 
of whose components are close to 1 (that is agents are able 
to “learn” their tasks), then we have likely induced large 
payoff utility intelligences. If we can also have the^ second 
term, P(e G | e ff ,s), be peaked about to equal to t 9 ^ (that 
is the payoff and world utilities are aligned), then eg will 
also be large. Finally, if the first term in^the integrand, 
P(G | eG,s), is peaked about high G when eg is large, then 
our choice of s will likely result in high G, as desired. 

2.2 Factoredness and Learnability 

The requirement that payoff functions have high “signal- 
to-noise” (an issue not considered in conventional work in 
mechanism design) arises in the third term. It is in the 


second term that the requirement that the payoff functions 
be “aligned with G” arises. In this work we concentrate on 
these two terms, and show how to simultaneously set them 
to have the desired form. 2 

Details of the stochastic environment in which the collec- 
tion of agents operate, together with details of the learning 
algorithms of the agents, are reflected in the distribution 
P(Q which underlies the distributions appearing m Equa- 
tion 2. Note though that independent of these considera- 
tions, out desired form for the second term in Equation 2 is 
assured if we have chosen payoff utilities such that e g equals 
exactly for all (• We call such a system factored. In 
game theory language, the Nash equilibria of a factored sys- 
tem are local maxima of G. In addition to this desirable 
equilibrium behavior, factored systems also automatically 
provide appropriate off-equilibrium incentives to the agents 
(an issue rarely considered in the game theory / mechanism 
design literature). 

As a trivial example, any “team game in which all the 
payoff functions equal G is factored [4, 13]. However team 
games often have very poor forms for term 3 in Equation 2, 
forms which get progressively worse as the size of the system 
grows. This is because for large systems where G sensitively 
depends on all components of the system, each agent may 
experience difficulty discerning the effects of its actions on 
G As a consequence, each 77 may have difficulty achieving 
high g n in a team game. We can quantify this signal/noise 
effect by comparing the ramifications on grj (0 arising from 
changes to Cn with the ramifications arising from changes to 
C* (i.e., changes to all nodes other than 77). We call this 
quantification the differential learnability of payoff utility 
grj , in the vicinity of ( [27]: 


= lIVCt? 9v(0\\ 

“ liv c -^(0ll ‘ 


(3) 


The denominator in Equation 3 reflects how sensitive £rj(C) 
is to changes in (>?, or changes to agents other than 77. In 
contrast, the numerator reflects how sensitive g v [Q is to 
changing C,. So at a given state C, the higher the learnabil- 
ity, the more g v { 0 depends on the move of agent 77, i.e., the 
better the associated signal-to-noise ratio for 77. Intuitively 
then, higher learnability means it is easier for 77 to achieve 
a large value of its intelligence. 

It can be proven that in many circumstances, especially in 
large problems, X„wlu(0 > KMO, *•«•> WLU has higher 
differential learnability than does the team game choice o 
payoff utilities [27]. This is mainly due to the second term 
of WLU which removes a lot of the effect of other agents 
(i e., noise) from tj’s utility. The result is that convergence 
to optimal G with WLU is much quicker (up to orders of 
magnitude so [27, 28]) than with a team game. 


2.3 Difference Utilities 

It is possible to solve for the set of all payoff utilities that 


2 Non-same theory-based function maximization techniques 
like simulated annealing instead address how to have term 
1 have the desired form. They do this by trying to ensure 
that the local maxima that the underlying system ultimately 
settles near have high G, by “trading off exploration and ex- 
ploitation” . One can combine such terni-i-b^edtechnu 
with the techniques presented here, The resultant hybrid al 
gorithm, addressing all three terms, outperforms simulated 
annealing by over two orders of magnitude [ 2 o j. 



are factored with respect to a particular world utility. Un- 
fortunately, in general it is not possible for a system both 
to be factored and to have infinite learnability (i.e., no de- 
pendence of any g n on any agent other than 77 ) for all of its 
agents [24]. However, consider difference utilities, which 
are of the form: 

U(0 = G(C) - r(/(0) . ( 4 ) 

where F(/) is independent of Cv- Such difference utilities 
are factored [27], In addition, under usually benign approx- 
imations, the differential learnability can be maximized over 
the set of difference utilities by choosing / = G and setting 
T to the expected value operator [27]. We call the resultant 
difference utility the Aristocrat Utility (AU), loosely re- 
flecting the fact that it measures the difference between an 
agent’s actual action and the average action: 

AU(0 = G(0-E(G\<:-ms). (5) 

If possible, we would like each agent 77 to use the associ- 
ated AU as its payoff function to ensure good form for both 
terms 2 and 3 in Equation 2. This is not always feasible 
however. The problem is that to evaluate the expectation 
value defining its AU each agent needs to evaluate the cur- 
rent probabilities of each of its potential moves. However if 
the agent then changes its payoff function to be the associ- 
ated AU it will in general substantially change its ensuing 
behavior. (The agent now wants to choose moves that max- 
imize a different function from the one it was maximizing 
before.) In other words, it will change the probabilities of 
its moves, which means that its new payoff function is in 
fact not the AU for its actual (new) probabilities. 

There are ways around this self-consistency problem, but 
in practice it is often easier to bypass the entire issue, by 
giving each r/ a payoff function that does not depend on the 
probabilities of 77 ’s own moves. One such payoff function is 
the Wonderful Life Utility (WLU). The WLU for agent rj 
is parameterized by a pre-fixed clamping parameter CL V 
chosen from among rf s legal or illegal moves: 

WLUn = <7(0 - GL v) • ( 6 ) 

WLU is factored no matter what the choice of clamping pa- 
rameter. Furthermore, while not matching the high learn- 
ability of AU, WLU usually has far better learnability than 
does a team game. 

Figure 2 provides an example of clamping. As in that ex- 
ample, in many circumstances there is a particular choice 
of clamping parameter for agent rj that is a “null move 
for that agent, equivalent to removing that agent from the 
system, hence the name of this payoff function. For such 
a clamping parameter WLU is closely related to the eco- 
nomics technique of “endogenizing a player’s (agent’s) ex- 
ternalities” [14]. Indeed, WLU has conceptual similarities to 
Vickrey tolls [20] in economics, and Groves’ mechanism [7] in 
mechanism design 3 . However, because WLU can be applied 
to arbitrary, time-extended utility functions, and need not 

3 Note also that Groves’ mechanism is restricted to public 
resources where an agent’s use of that resource does not 
affect the ability of other agents to use the resource (e.g., a 
lighthouse) and therefore the tragedy of the commons does 
not arise. Grove’s mechanism was especially formulated to 
solve the problem of agents being untruthful in reporting 
their utility for public goods, a problem not present in the 
COIN framework. 
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0 0 0 
1 0 0 
0 1 0 


Figure 2: This example shows the impact of the 

clamping operation on the joint state of a four- 
agent system where each agent has three possible 
actions, and each such action is represented by a 
three-dimensional unary vector. The first matrix 
represents the joint state of the system C where 
agent 1 has selected action 1, agent 2 has selected 
action 3, agent 3 has selected action 1 and agent 4 
has selected action 2. The second matrix displays 
the effect of clamping agent 2’s action to the “null 

vector (i.e., replacing O 2 with °)* The third 
trix shows the effect of instead clamping agent 2 s 
move to the “average” action vector a = {.33, .33, .33}, 
which amounts to replacing that agent’s move with 
the “illegal” move of fractionally taking each possi- 
ble move (Cr 72 = 0- 


be restricted to the “null” clamping operator interpretable 
in terms of “externality payments”, it can be viewed a gen- 
eralization of these concepts. 

For example, it is usually the case that using WLU with 
a clamping parameter that is as close as possible to the ex- 
pected action (and not the “null” action) results in higher 
learnability than does clamping to the null move. Such a 
WLU is roughly akin to a mean-field approximation to AU. 
For example, in Fig. 2, if the probabilities of agent 2 making 
each of its possible moves was 1/3, then one would expect 
that a clamping parameter of a would be close to optimal. 
Accordingly, in practice, use of such an alternative WLU de- 
rived as a “mean-field approximation” to AU almost always 
results in better values of G than does the endogenizing 
WLU [28]. 

Intuitively, one can look at AU and WLU from the per- 
spective of a human company, with G the “bottom line’ 
of the company, the agents 77 identified with the employ- 
ees of that company, and the associated g v given by the 
employees’ performance-based compensation packages. For 
example, for a “factored company”, each employees com- 
pensation package contains incentives designed such that 
the better the bottom line of the corporation, the greater 
the employee’s compensation. As an example, the CEO o 
a company wishing to have the payoff utilities of the em- 
ployees be factored with G may give stock options to the 


Formally, our approximation is exact only if the expected 
alue of G equals G evaluated at the expected joint move 
;>oth expectations being conditioned on given moves by 
11 agents other than t/). In general though for relatively 

no oth G ^ we would ex P ect such a mean ; fieId approxima- 

on to AU, to give good results, even if the approximation 
oes not hold exactly [28]. 



employees. The net effect of this action is to ensure that 
what is good for the employee is also good for the com- 
pany. In addition, if the compensation packages have hig 
learnability", the employees will have a relatively easy time 
discerning the relationship between their behavior and their 
compensation. In such a case the employees will both have 
the incentive to help the company and be able to determine 
how best to do so. Note that in practice, providing stock 
options is usually more effective in small companies than m 
large ones. This makes perfect sense in terms of the COIN 
formalism, since such options generally have higher learn- 
ability in small companies than they do in large companies, 
in which each employee has a hard time seeing how is/ er 
moves affect the company’s stock price. 

3. MULTI-AGENT GRID WORLD PROBLEM 

3.1 Problem Description 

A common reinforcement learning problem is the Grid 
World Problem [18], where an agent navigates about a two- 
dimensional n x. n grid. At each time step, the agent can 
move up, down, right or left one grid square, and receives a 
reward after each move. The observable state space for the 
agent is its grid coordinate and the reward it receives de- 
pends on the grid square to which it moves. In the episodic 
version, which is the focus of this paper, the agent moves 
for a fixed number of time steps, and then is returned to its 
starting location. This problem typically requires the use of 
a reinforcement learner that can optimize a sum of rewards 
in contrast to one that optimizes an immediate reward since 
the agent may have to cross squares of low reward value to 
enter the squares of high value. Q-learners or the Sarsa 
algorithm [18] are often used for this problem. In this pa- 



Figure 3: Agents collecting tokens of varying value. 

per we apply COIN theory to a multi-agent version of the 
Grid World Problem. In this problem there are multiple 
agents navigating the grid simultaneously interacting with 
each others’ rewards. This reward interaction is modeled 
through the use of tokens that axe distributed throughout 
the grid squares of the grid world (Figure 3). Each to en 
has a value between zero and one, and each grid square can 
have at most one token. When an agent moves into a grid 
square it receives a reward for the value of the token and 


then removes the token so that a reward will no longer be 
received when an agent enters the grid square^ However, all 
the tokens are reset at the end of an episode The global ob- 
jective of the Multi-agent Grid World Problem is to collect 
the highest aggregated value of tokens in a fixed number of 

The Multi-agent Grid World Problem is an idealized ver- 
sion of many real world problems, including the contro o 
multiple planetary exploration vehicles (e.g., rovers on the 
surface of Mars, collecting rocks in an attempt to maximize 
total scientific return, submersible under Europa examining 
potential life signs). Furthermore, the agent interaction pro- 
vides a critical study of coordination and interference, as the 
agents have the potential to work at cross-purposes. This 
problem can also exhibit the tragedy of the commons 8], 
where each agent attempting to maximize its own uti i y 
can drive the world utility to severely sub-optimal values. 
As such, the design of the payoff functions is crucial in this 
problem, and we address this issue below . 

3.2 COIN Solution 

To pose the Multi-agent Grid World Problem in the form 
of the COIN framework we need to define: 

• L,, ( : The location of agent q at time t; 

• L t : The location of all agents at time t; 

• L: The locations of all agents for all time; 

• L- v : The locations of all agents other than q for all 
time; 

• T: The initial value and location of all tokens. 

The space Z is composed of L and T and the world line 
C is a point in this space. We now define the function 
V( L„t L.T) to return the value of a token picked when 
an agent moves into location L^,t- This function uses t e 
information from L to determine whether a token in location 
Lq t has already been picked up by time t in which case 
it returns 0 regardless of the value of the token at location 

'rhe global utility G(<) is sum off all the tokens picked 
during an episode. 


G (0 = y]V(L,, t ,L,T) 


v,< 

In our experiments we set the clamping parameter CL„ to 
a “null” location, so that clamping had the effect of removing 
the agent from the world line. The WLU can now be defined 

as: 

WLUniO = G«) - V(V,t.£n. r ) 

The second term of the equation returns the world utility 
in an episode without agent q, therefore the entire TAJ 
returns an agent’s contribution to the global collection o 

t0 \Ve S now define the two single time step rewards. The 
world reward is the sum of all the tokens picked up a a 
single time step t : 


Rt{ 0 = 



The WL reward is defined similarly to the WLU, except that 
it uses the world reward Rt , rather than the world utility G. 

WLRnAO = M 0 ~ J2 

Note that these two rewards are still a function of the time- 
extended space as V is a function of L and T. 

The AU reward for this problem is given by: 

AUr,AO = RtiO - \ E E y ^‘ >L ’ T) 

where N(Lr,,t- 1 ) gives the set of possible locations at time 
step t, given a location at t = t — 1. Note, to avoid the 
self-consistency issue discussed in Section 2, we use an ap- 
proximation to the expected value by averaging over the 
last four possible directions, rather than estimate the cor- 
rect probabilities of taking each action (see Section6). 

4. RESULTS 

To evaluate the effectiveness of the COIN approach in the 
Multi-agent Grid World, we conducted experiments where 
the agents used four different utility functions. The first of 
these was the Selfish Utility (SU), where each agent receives 
the weighted total of the tokens that it alone collected. It is 
the natural extension of the single agent problem, and rep- 
resents the optimal utility in the single rover domain. The 
second utility was the Team Game (TG) utility where each 
agent received the full world utility. The third utility was 
the WLU, which represents the contribution an agent made 
to the token collection, by looking at the difference in the to- 
tal token collection with and without that agent. The fourth 
and final utility was AU, where the agent’s contribution is 
computed as the difference between the action it took and 
its expected action. 



Figure 4: Effect of Payoff Utility on System Perfor- 
mance (10 agents on a 10x10 grid). 

Figure 4 shows the results for 10 agents on a 100 unit- 
square grid where an episode consists of 10 time steps (in- 
cluding error bars of ± one a). The results showed that SU 
produced poor results, results that were indeed worse than 
random actions. This is caused by all agents aiming to ac- 
quire the most valuable tokens, in effect competing rather 
than cooperating. The agents using TG fared better, but 
their learning was slow. This system was plagued by t e 


signal-to-noise problem associated with each agent receiv 
ing the full world reward for each individual action they 
took. Notice both the selfish agents and those trained with 
TG had a drop in their performance in the early going, as 
they learned the “wrong” actions. Team game agents over- 
came this early setback whereas selfish agents never did. In 
contrast, agents using WLU and AU performed almost opti- 
mally, because the reinforcement signal they received more 
clearly showed how their actions affe cted the world rew ar . 



Number of Training Episodes 


Figure 5: Effect of Payoff Utility on System Perfor- 
mance (100 rovers on a 32x32 grid). 



Figure 6: Scaling Properties of Different Payoff 

Functions. 

Figure 5 shows results for 100 agents on a 1024 unit-square 
grid where an episode consists of 32 time steps. Qualita- 
tively, the results axe similar to the 10 agent case. However, 
note that the team game agents have a harder time learning, 
because in this case the reinforcement signal is even further 
diluted. We explore this scaling issue in more detail in Fig- 
ure 6. With very few agents, the selfish learners did not 
compete with each other as much and were able to obtain 
acceptable results. Their performance however, deteriorated 
rapidly, when the number of agents in the system increased. 
Similarly, agents using the team game reward were not ham- 
pered as much by the noise associated with other agents 
when the number of agents was low. As the system scaled 
up however, only the WLR-trained agents were able to oper- 
ate collectively. This underscores the need for a utility that 
has good signal-to-noise properties so that the agents have 
an opportunity to learn the actions that will optimize their 
utilities. 






5. WONDERFUL LIFE UTILITY FUNCTION 

EXAMPLE 

In this simple example, we demonstrate how each agent 
optimizing its Wonderful Life Utility will optimize the world 
utility (Figure 7). Suppose that two agents are on a six 
square world, can move left and right and can take actions 
for two time steps. There are two tokens, one of values 5 and 
the other of value 10 that the agents can pick up by entering 
the appropriate square. A plausible, yet non-opt.mal set of 
actions consists of agent 1 moving right twice and agent 
moving left first and then taking an arbitrary action. I o e 
Lt this is the set of actions that would b, selected uf y nts 
were using the selfish reward described in Section 3. In this 
scenario fgent 2 will pick up a token worth 10 on its first 
time step and no tokens on the second time step Agent 1 
will not pick up any tokens. This results in a world reward 
of 10 for the first time step and 0 for the second, resulting 

in a world utility of 10. . . r 

Now, let us look at agent 2’s WL payoff for this se of 
moves: For the first time step, the WL reward turns out o 
be the same the the selfish reward: agent 2 receives 10 or 
picking up the token. The WLR for agent 2 in the second 
time step is more interesting. The second parameter of the 
V function, L-„, now does not include agent 2, causing thi 
function do disregard any tokens agent 2 Pf^wus y pic ^ 
up. This causes the V function to report that the token ot 
value 10 is still available in the second time step. Since agent 
1 moves into the square with this token in the second time 
step it receives credit for picking up the token, meaning 
the world reward without agent 2 is 10 for this time step. 
Because the world reward with agent 2 was 0 (token picked 
up previous time step), the WL reward for agent 2 for the 

second time step is WLR r , 2 ,t =2 - 0 1 ' 'hiition to 

the WLR can be thought of as an agent’s contribution to 

the world reward at this time step. Since at t - 1 agent Z 
picked up a token that could have been picked up at t - A 
it had a deleterious effect on the second time step. 

Now the time-extended WLU for agent 2 can be computed 
by summing the WLRs. This results in a WLU of 0, even 
though agent 2 picks up a token weighted 10 (10 for t-1 and 
-10 for t=2). The interpretation for this counter-intuitive 
utility value is clear: because that token would have been 
picked up by agent 1 at another time step, the net effect 
agent 2’s actions on the world utility was ml, resulting in 
WLU value of 0. Because moving to the right twice P rovl <* es 
a W T LU value of 5 for agent 2, an agent optimizing its 
payoff utility will take this second action. Similarly agent 1 
moving right twice will receive a WLU of 10. As this simple 
example shows, each agent maximizing its WLU leads .the 
system to the world utility maximum where both tokens are 

P1C Let 1 u's P analyze the game-theoretic “equilibrium” solution 

for WLU and SU in these two solutions: The bU is in a 
Nash equilibrium for the first set of moves, m that neither 
agent can improve its SU by unilaterally changing its ac- 
tions. Therefore, the system is “stuck” in this subopt.ma 
solution. Furthermore, even if the agents stumble upon the 
second solution by accident, they will not remain there as 
this solution is unstable with payoff utilities given by SU. 
Agent 2 can change its move (in future episodes) and 
prove its payoff utility from 5 to 10. That this move reduces 
agent l’s utility from 10 to 0, and the world utility from lo 


s„ 

- - 

► ®i 


B 

Agent2 

0 

0 

fc=l 

t=2 

W LU SU 

S Th lb ’ll 

10 0 10 o 10 

o 0-10 oo 

Total 10 0 0 

Ag^tsTakiigOptiralActicns P = 

0 10 
15) 

^ « 

Agsitl 


0 

Agent! 


►0 


t?l 

fe =2 


0 

15 


Total 


W LU 

SU 


til tl2 

Th _ 

th 

0 o 

0 

0 

10 5 

10 

5 

10 5 

10 

5 


Figure 7: WLU Nash Equilibria and World Utility 
Optima 

to 10 has no influence on agent 2’s actions. Note, however, 
that agents with WLU as their payoff utilities axeman equi- 
librium state in the second set of actions. They will therefore 
seek this solution as it offers higher payoff utilities each 
agent The use of WLU has the net effect of aligning the 
S equilibrium of the agents with the world utility opti- 
mum, ensuring that when the agents optimise : then ^payoff 
utilities, the world utility is also at a local - and in this case 
also the global - optimum. 

6. DISCUSSION 

In this work we focus on the problem of designing a collec- 
tive of autonomous agents that individually 
of actions such that the resultant sequence of joint actions 
achieves a predetermined global objective. In par icu 
discuss the problem of controlling multiple agents m i a grid 
world, a problem related to many real world problems in 
eluding exploration vehicles trying to maximize aggregate 
scientific data collection (e.g., rovers on the surface of Mars ■ 
In this domain, we addressed the critical issue of what util- 
ity functions those agents should strive to maximize, 
extended previous results on collective intelligence to agents 
attempting to maximize sequences of actions, and used Q 
learning with rewards set by COIN theory. ur resu 
demonstrate that RL rovers using COIN-der.ved goals out- 
perform both “natural” extensions of single ag^t algorithms 
and global reinforcement learning solutions based on team 

§ Our investigations revealed an interesting situation where 
the theoretically “best” strategy was not necessarily the best 
approach in practice. Although AU is theoretically superior 
to WLU (higher learnability), two issues prevent us from 
fully exploiting its power: First, the “expected” action is 




impossible to compute in a time extended setting, since even 
a simple case where an agent has four actions and ten time 
steps leads to 4 10 possible actions. Even Monte Carlo sam- 
pling of such a space will yield highly inaccurate estimates 
of the potential actions and their rewards. Second, estimat- 
ing the correct probability distributions over the possible 
actions causes the utility values to change, creating a self- 
consistency problem. To sidestep both issues, in this article 
we chose to focus on the last time step (e.g., current step for 
the agent) and approximated the AU with the agent taking 
each of the four actions possible in that time step with equal 
likelihood. The resulting utility function provided good so- 
lutions, but the performance of such a “handicapped” AU 
did not exceed those of the conceptually simpler WLU. 

Future work in this area includes investigating efficient AU 
computation for sequences of actions, and investigating the 
“mean field” approximation to AU by clamping the actions 
of an agent to the average action discussed in Figure 2. This 
approach avoids both difficulties associated with the proper 
AU, and has been shown to lead to good world utility values 
in single step reward maximization problems [28]. 
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